These slides illustrate different methods of generating predictions from multivariate models. Generating quantities of interest from models is a key part of the modeling process. We can use these predictions to understand the model, and to evaluate counterfactuals.
Quantities of interest generally include a prediction of \(y\) for a given set of \(x\) values, and the uncertainty around that prediction. We can also compute the marginal effects of \(x\) on \(y\), and the uncertainty around those marginal effects.
Here, we’ll examine three methods:
At-means predictions (adjusted effects)
Average effects
Simulated effects
These methods are all flexible and adaptable, and are what we commonly use in linear (OLS) and non-linear (MLE) models. So you’ll see these methods quite in quantitative social science modeling.
Major League Baseball Attendance
We’ll use a dataset of Major League Baseball attendance to illustrate different methods of generating predictions from multivariate models.
code
mlb <-read.csv("/users/dave/documents/teaching/501/2023/slides/L3_multivariate/code/mlbattendance/MLBattend.csv") # rescale home attendancemlb$Home_attend <- mlb$Home_attend/1000datasummary(All(mlb) ~ mean+ median+ sd+min+max, data=mlb, style="grid", title ="MLB attendance data")
MLB attendance data
mean
median
sd
min
max
Season
1985.06
1985.00
9.27
1969.00
2000.00
Home_attend
1777.99
1681.90
755.87
306.76
4483.35
Runs_scored
694.94
691.50
105.17
329.00
1009.00
Runs_allowed
694.89
693.00
105.52
331.00
1103.00
Wins
78.85
79.00
12.67
37.00
114.00
Losses
78.88
79.00
12.65
40.00
110.00
Games_behind
14.39
13.00
11.75
0.00
52.00
Bivariate model, predictions
Here’s a simple model predicting season attendance at MLB games, regressing home attendance on runs scored, then generating “in-sample” predictions by computing:
Here’s a model of attendance at MLB games, regressing home attendance on runs scored, runs allowed, season (year), and games behind, then generating “in-sample” predictions.
code
m3 <-lm(data=mlb, Home_attend ~ Runs_scored + Runs_allowed+ Season +Games_behind)# modelsummary(m3)stargazer(m3, type="html", title="Multivariate model of MLB attendance")
As you can see, the predictions are pretty ugly because all four \(x\) variables are varying by observation, so we’re getting predictions based on each individual observation’s characteristics - interesting if we care about the 1977 Red Sox, but not something we can draw generalities from. What’s missing here?
Control
Hold all variables constant at means, medians, or modes, except the one of interest, so we can focus on how changes in the \(x\) of interest affect \(\widehat{y}\).
Prediction Methods for Multivariate Models
There are lots of ways to generate predictions - this covers three common and flexible types. In general, these are “out of sample” methods - they are often not relying on the estimation data, but on new data intended to represent key features of the estimation data. They all produce predictions for the number of “interesting values” on your \(x\) variable of interest. So they generally do not produce \(N\) predictions.
Prediction methods
These three approaches are all very flexible/adaptable, and are what we commonly use in linear (OLS) and non-linear (MLE) models. So you’ll see these methods quite in quantitative social science modeling.
At-means predictions (adjusted effects)
At-means predictions (also called “adjusted effects”) set all variables except the \(x\) of interest at a sensible value - usually the mean, median, or mode depending on the level of measurement. Holding those constant, but varying just the \(x\) of interest produces predictions of \(y\) that are neater and easier to discuss than the in-sample ones above.
Summarize the estimation data
Let’s find the means etc. of the variables in the model. It’s important only to consider the cases that are in the estimation data, that is, actually used in the model. Write code to identify cases in the model, then use the means, etc. of these for the predictions:
In this case, it turns out we use all the cases in the data - it’s important to check this any time you’re making model predictions.
Generate at-mean predictions
Create a new data frame with as many observations as there are interesting values of the \(x\) variable of interest. Then, set all variables except the \(x\) of interest at their means, medians, or modes. Using the standard errors of the predictions, generate the boundaries of the confidence intervals by end point transformation.
R’s predict() function does this for us, but it’s important to understand the process. The predict() function generates the predictions and the standard errors of those predictions. The interval="confidence" option tells R to generate the confidence intervals, and the se.fit=TRUE option tells R to generate the standard errors of the predictions.
Aside on standard errors of \(\widehat{y}\)
The standard errors of \(\widehat{y}\) are calculated using the \(x\) values, and the variance-covariance matrix of the coefficients. The maximum likelihood method is as follows:
\[ \sqrt{diag(\mathbf{X V X'})} \]
Where \(X\) is the matrix of \(x\) values (probably the constructed, out of sample \(X\) matrix), \(V\) is the variance-covariance matrix of \(\widehat{\beta}\), and \(X'\) is the transpose of the \(X\) matrix. The standard errors of \(\widehat{y}\) are calculated by taking the square root of the main diagonal of this matrix.
Uncertainty about \(\widehat{y}\)
The intuition is that we’re using our uncertainty about the coefficients to generate measures of uncertainty about the predictions.
Average effects, end point boundaries
Average effects are where we’re computing \(N\) predictions for every interesting value of \(x\), then taking the average of those predictions for each value of \(x\). This is a counterfactual approach - we’re using the actual data, changing only the variable of interest, as if all observations took on the same value of that variable.
For instance, in our MLB attendance model, we change the value of “Runs Scored” to the minimum value, 329, for every observation, keeping all the other variables as they are in the real estimation data. Compute the predictions for all the observations at \(x = 329\), then take the average of those predictions - now iterate to 330, 331, etc.
Counterfactuals
We are treating the estimation data as if every team scored 329 runs - this is the counterfactual - then computing the average attendance for that counterfactual, and doing this for all counterfactuals (values of \(x\)).
It’s important to note we are using the estimation data to compute average effects, looping over values of the variable of interest, then saving the average (median) prediction and the median standard error of that prediction.
code
# preserve the original values of Runs Scoredmlb <- mlb%>%mutate(original_runs_scored = Runs_scored)xb =0se =0rs =0for(i inseq(330,1000,1)){ #iterating by 1 to max number of obs we're taking medians for mlb$Runs_scored <- i mlb.predict <-data.frame(predict(m3, interval="confidence", se.fit=TRUE, level=.05, newdata=mlb)) xb[i-329] <-median(mlb.predict$fit.fit) #index using i but minus constant to start at row 1 se[i-329] <-median(mlb.predict$se.fit) rs[i-329] <- i}avg.pred <-data.frame(xb, se, rs)# upper and lower bounds by end point transformationavg.pred <- avg.pred %>%mutate(ub=xb+1.96*se) %>%mutate(lb=xb-1.96*se)# reset the estimation data to the actual values of Runs Scoredmlb <- mlb%>%mutate(Runs_scored=original_runs_scored )#plotaverage <-ggplot(avg.pred, aes(x=rs, y=xb)) +geom_line() +geom_ribbon(aes(ymin=lb, ymax=ub), alpha=.2) +xlab("Runs Scored") +ylab("Expected Attendance (in thousands)") +ggtitle("Average Effects") +theme_minimal()average
Comparing at-mean and average effects
code
atmean + average
Simulated effects
The last method we’ll consider is simulated effects. This is a way to generate predictions that are based on the distribution of the coefficients. The \(\widehat{\beta}\)s are estimates of the mean of a normal distribution, and the var-cov matrix of the coefficients is an estimate of the variance of that distribution. We can assume the distribution of \(\beta\) is Normal because of the Central Limit Theorem.
Here’s the process. Treat our \(\widehat{\beta}\)s as the mean of a normal distribution, and the var-cov matrix of the coefficients as the variance. Generate a large number of draws from that distribution, say 10,000. This will give us the simulated distribution of \(\widehat{\beta}\). We’ll have 10,000 estimates of \(\widehat{\beta}\).
Using these, we can now produce 10,000 predictions for each interesting value of \(x\). We can then summarize the distribution of those predictions, using the median as our point estimate, and the percentiles (2.5, 97.5) as our confidence boundaries.
code
B <-data.frame(rmvnorm(n=10000, mean=coef(m3), vcov(m3))) #simulated distribution of the estimates using the var-cov matrix of B as the variance, and the estimates of B as the mean.colnames(B) <-c('b0', 'b1', 'b2', 'b3', 'b4')sim.preds <-data.frame ( lb =numeric(0),med=numeric(0),ub=numeric(0), r =numeric(0))for(i inseq(330,1000,1)){ xbR <-quantile(B$b0 + B$b1*i + B$b2*694+ B$b3*1985+B$b4*14, probs=c(.025,.5,.975)) xbR<-data.frame(t(xbR)) sim.preds[(i-329):(i-329),1:4] <-data.frame(xbR, i)}#plotsim <-ggplot(sim.preds, aes(x=r, y=med)) +geom_line() +geom_ribbon(aes(ymin=lb, ymax=ub), alpha=.2) +xlab("Runs Scored") +ylab("Expected Attendance (in thousands)") +ggtitle("Simulated Effects") +theme_minimal()sim
Comparing methods
code
(atmean + average + sim)
Summary
These look quite similar, so what’s the point of these different approaches? Different data, models, and questions will sometimes point you toward a particular method. For instance, fixed effects models will likely push you to average effects. We’ll encounter a variety of circumstances that will make one or another of these methods more or less appropriate both in this class and in the non linear models class.